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Abstract 

Many real world, complex phenomena have underlying structures of evolving networks where nodes and links are added and 
removed over time. A central scientific challenge is the description and explanation of network dynamics, with a key test being the 
prediction of short and long term changes. For the problem of short-term link prediction, existing methods attempt to determine 
neighborhood metrics that correlate with the appearance of a link in the next observation period. Recent work has suggested that 
the incorporation of user-specific metadata and usage patterns can improve link prediction, however methodologies for doing so 
in a systematic way are largely unexplored in the literature. Here, we provide an approach to predicting future links by applying 
an evolutionary algorithm to weights which are used in a linear combination of sixteen neighborhood and node similarity indices. 
We examine Twitter reciprocal reply networks constructed at the time scale of weeks, both as a test of our general method and as 
a problem of scientific interest in itself. Our evolved predictors exhibit a thousand-fold improvement over random link prediction 
with high levels of precision for the top twenty predicted links, to our knowledge strongly outperforming all extant methods. Based 
on our findings, we suggest possible factors which may be driving the evolution of Twitter reciprocal reply networks. 

Keywords: algorithms, data mining, link prediction, social networks. Twitter, complex networks, complex systems 
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1. Introduction 

Time varying social networks can be used to model groups 
whose dynamics change over time. Individuals, represented by 
nodes, may enter or exit the network, while interactions, rep- 
resented by links, may strengthen or weaken. Most network 
growth models capture global properties, but do not capture 
specific localized dynamics such as who will be connected to 
whom in the future. And yet, it is precisely this type of infor- 
mation that would be most valuable in applications such as na- 
tional security, online social networking sites (people you may 
know), and organizational studies (predicting potential collab- 
orators). 

In this paper, we focus primarily on the link prediction prob- 
lem: given a snapshot of a network Gt = (V, Et), with nodes 
V (nodes present across all time steps) and links Ef, at time t, 
we seek to predict the most likely links to occur in the next 
timestep, t -\- 1 |1|. 

Link prediction strategies may be broadly categorized into 
three groups: similarity based strategies, maximum likelihood 
algorithms, and probabilistic models. As noted by Lu et al. |2|, 
the latter two approaches can be prohibitively time consuming 
for a large network (> 10, 000 nodes). Given our interest in 
large, sparse networks, we focus primarily on local information 
about user-user connections and use similarity indices to char- 
acterize the likelihood of future interactions. We consider the 
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two major classes of similarity indices: topological-based and 
user- specific (Table [T]). 

There does not appear to be one best similarity index that is 
superior in all settings. Depending on the network under anal- 
ysis, various measures have shown to be particularly promis- 
ing |T1|3HE1. These findings suggest that the predictors which 
work "best" for a given network may be related to the inherent 
structure within the individual network rather than a universal 
best set of predictors. Further, it is also plausible that the best 
link predictor may change as the network responds to endoge- 
nous and exogenous factors driving its evolution. 

Topological similarity indices encode information about the 
relative overlap between nodes' neighborhoods. We expect that 
the more "similar" two nodes' topological neighborhoods are 
(e.g., the more overlap in their shared friends), the more likely 
they may be to exhibit a future link. The common neighbors 
index, a building block of many other topological similarity in- 
dices, has been shown to correlate with the occurrence of future 
links [9J. Several variants of this index have been proposed and 
have been shown to be useful for link prediction in a variety of 
settings (see (T\ for a review). 

In their seminal paper on link prediction, Liben-Nowell and 
Kleinberg 1 1 1 examined author collaboration networks derived 
from arXiv submissions in four subfields of Physics. They 
found that neighborhood similarity measures, such as the Jac- 
card 1 10], Adamic-Adar (Til, and the Katz coefl&cients |[T2l ) 
provided a large factor improvement over randomly predicted 
links. Various other neighborhood similarity indices have been 
used to predict missing or future links |[3ll4ir7l lT3lfT4ll 

As a complement for topological similarity indices, user- 
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Figure 1: A visualization of a one week Twitter reciprocal reply network exhibiting interactions between a core of 25,936 users who were active in each of networks 
in the period from September 9, 2008 to October 20, 2008. Note the large degree observed in one community (inset). The colors indicate modularity, a proxy for 
community structure, as detected by Gephi's implementation of Blondel's "Fast unfolding of communities in large networks" 1151 . 



specific similarity indices examine features of nodes (e.g. lan- 
guage, topical similarity, behavior, etc.). Several studies have 
suggested that incorporating these measures can enhance link 
prediction in social networks ll2ll4l [T6tiT9 l. In an eff'ort to incor- 
porate multiple indices into one model, some researchers have 
used supervised learning (e.g., support vector machine |20|, de- 
cision trees L4J, supervised random walks L6J, multi-layer per- 
ceptrons, and others.) to train algorithms for link prediction. 
Al Hasan et al. [W] use both topological and user- specific fea- 
tures to compare several supervised learning algorithms. They 
found that support vector machine (SVM) performed the best 
for the prediction of future links. While SVM is often consid- 
ered the state of the art supervised learning model, one of its 
major drawbacks relates to kernel selection 1211 . 

Of particular interest, Wang et al. |4| study a network of indi- 
viduals constructed from mobile phone call data. They compare 
similarity indices used in isolation to a link predictor combin- 
ing several indices (binary decision tree determined from su- 
pervised learning). These researchers found that the combina- 



tion of user- specific and topological similarity indices outper- 
form topological indices in isolation. While their results are 
promising, they acknowledge that the cost comes from look- 
ing at only a promising subset (e.g. 300 potential links which 
have Adamic-Adar scores > 0.5 and Spatial Co-location rate 
> 0.7) from the large potential set of user-user pairs two-links 
away (e.g. 266,750). We aim to provide a link predictor which 
encompasses both topological and user-specific information, 
without parametric thresholds. 

In recent years, there has been a surge of interest in view- 
ing Twitter activity through the lens of social network analysis. 
In many studies, nodes represent individuals and links repre- 
sent following behavior ll22H24ll . reciprocated following (251, 
replies |[T9ll or reciprocated replies 1261 . 

Link prediction efforts related to Twitter have largely focused 
on predicting follower relationships. Rowe, Stankovic and 
Alani [17 1 use supervised learning to combine topological and 
individual specific features (e.g., topics of tweets, tweet counts, 
re- tweets, etc.) to predict following behavior. Romero and 



Kleinberg also examined link prediction in follower networks 
and suggest that directed closure plays an important role in the 
formation of new links |27|. Hutto, Yardi, and Gilbert |18| 
examine 507 individuals and their followers to find that user- 
specific characteristics, such as message content and behavior 
should be given equal weight as topological characteristics for 
link prediction. Yin, Hong, and Davison examine 979 individu- 
als and their neighbors (in Twitter follower networks) to predict 
following behavior over a 6 week time-scale |8|. Golder et al. 
examine Twitter users' desire to follow another user connected 
by a path length of 2. They examine the correlation between 
shared interests and reciprocated following on users' expressed 
interest to make a new link (i.e., follow) and suggest that mutu- 
ality (reciprocated attention) is correlated with increased desire 
to follow 1 28 1 . 

In this paper, we focus our attention on developing a method- 
ology for link prediction which is independent of network type 
and requires no prior knowledge about the system under analy- 
sis. We apply our technique to the link prediction problem for 
a large, dynamic, social network: Twitter reciprocal reply net- 
works (RRNs), a construction first proposed by Bliss et al. 1261 . 
We examine the evolution of these networks constructed at the 
time scale of weeks, where nodes represent users and links rep- 
resent evidence of reciprocated replies during the time period 
of analysis. While many other studies have examined follow- 
ing and reciprocated following, we use reciprocated replies as 
evidence of social interaction and active engagement of indi- 
viduals Q 

We explore how neighborhood similarity measures and user 
specific data can be combined as a weighted sum to compute 
scores used in a link prediction tool. Rather than pre-supposing 
that all similarity indices are of equal importance, we allow the 
weights of this linear combination to adjust using an evolution- 
ary algorithm (Covariance Matrix Evolution, CMA-ES 1.29 J ). 
This approach has the advantage of being able to detect which 
similarity indices are more salient predictors and allows for the 
incorporation of multiple similarity indices without any knowl- 
edge of the type of network one may be working with. Although 
we demonstrate sixteen similarity indices here, we emphasize 
that any other similarity indices may be interchanged for or 
added to the ones included in this study. The choice of which 
similarity measures to include will largely depend on available 
data (e.g., metadata for users in the context of the network one 
is studying) and the size of the network under consideration. 
The detection of indices which function as good predictors for 
future links can help to elucidate possible mechanisms which 
may drive the evolution of the network over time. 

Due to the large size of networks that we seek to study and 
the hypothesis that friends of friends are more likely to be- 
come friends than individuals who have no friends in com- 
mon (301 ED. we restrict out attention to the prediction of 



new links at time t -\- 1 which occur between individuals who 
were separated by a path length of 2 at time t (i.e., triadic clo- 
sure). Empirical evidence suggests that a preponderance of new 
links form between such 2-link neighbors in email reply net- 
works 1321 , Twitter follower networks |27 1, and Twitter RRNs|j 
We organize our paper as follows: In Section 2, we describe 
our data, the sixteen similarity indices, and the evolutionary al- 
gorithm used for evolving the weights on these indices. In Sec- 
tions 3 and 4, we present our results, discuss the significance of 
these findings, and suggest future directions for further work in 
this area. 



2. Methods 

2.1. Data 

Our data set consists of over 5 1 million tweets collected via 
the Twitter gardenhose API service from September 9, 2008 
to December 1, 2008. This collection represents roughly 40% 
of all messages sent during this period (Table Al). Using the 
criteria defined by Bliss et al. |26 1, we construct reciprocal reply 
networkq^ as unweighted, undirected networks in which a link 
exists between nodes u and v if and only if these individuals 
exhibit reciprocal replies during the week under analysis (Fig. 

Each of the reciprocal reply networks is placed into one of 
either an early set or late set. The early set consists of networks 
constructed for each of the six weeks from September 9, 2008 
to October 20, 2008 and the late set consists of networks con- 
structed for each of the six weeks from October 21, 2008 and 
December 1, 2008. We find a core of 25,936 users who were 
active in each of networks in the early period (Veariy = ^^^i^i) 
and examine the induced subgraphs on Veariy constructed at 
the time scale of weeks. Similarly, we find a core of 44,439 
users who were active in each of the weeks in the late period 
(Viate = ^^jij^j) ^^^ examine the induced subgraphs on Viate- 
The increase in user count is a consequence of the rapid growth 
of Twitter during this period. 

We train our link predictor on the new links that occur in the 
early networks and validate on the late networks. To explore 
the sensitivity of our method on finer timescales (e.g., a week), 
we conduct a separate set of experiments which we train on 
new links which occur in a given Week t (e.g., e e Et \ Ef-i) 
and validate on the new links that occur in week t -\- 1 (e.g., 
e e Et+i \ Et). Further details are outlined in the next two 
subsections. 

2.2. Similarity indices 

Similarity indices capture the shared characteristics or con- 
texts of two nodes. We briefly overview 16 similarity indices 



^Following is a relatively passive activity and the establishment of a link 
between such users may misrepresent current attention to information in the 
network. Furthermore, follower networks typically do not account for the "un- 
friending" problem and the accumulation of dead links in a network can distort 
the representation of the true state of the system and spam. 



^We observe approximately 35% of new links occurring between individu- 
als connected by a path of length 2. 

^We also construct reply networks, whereby nodes represent users and di- 
rected, weighted links represent the number of replies sent from one individual 
to another during the week under analysis. Reply networks are used in the 
computation of the average path weight, one of our similarity indices. 



chosen for inclusion in our link predictor, but wish to empha- 
size that any number of other similarity indices may be cho- 
sen for inclusion in the evolutionary algorithm. The choice of 
which similarity indices to include may largely depend on the 
metadata one has about the nodes and interactions, as well as 
the size of the network. 

Topological similarity indices may be characterized by local, 
quasi-local, or global measures. Since global similarity mea- 
sures (i.e. Katz, SimRank, and Matrix Forest Index) are com- 
putationally laborious for large networks |33|, we forgo these 
measures in lieu of local topological indices, which we describe 
in Table [T] For the node similarity we calculate four indices: 
Twitter Id similarity, tweet count similarity, word similarity and 
happiness similarity. We then rescale the computed scores to 
range from to 1, inclusive, and store slsNxN sparse matrices. 

In large networks, some similarity indices are computation- 
ally burdensome. One technique for dealing with this problem 
is to restrict link prediction to only those nodes with a minimal 
path length of 2 or 3 1 1 , 4|. Given the large size of our networks 
and the observation that approximately 35% of new links from 
week ^ to ^-1- 1 in our training set occur between individuals with 
path length two, we restrict our computations to these user-user 
pairs. 

We depict frequency plots for the computed similarity indices 
in Figure [2] These plots demonstrate that none of the similarity 
indices separate the newly formed "links" (user-user pairs who 
are separated by a minimal path of length 2 at r and a path of 
length 1 at ^ -h 1) and "duds" (user-user pairs who are separated 
by a minimal path of length 2 at r and a path of length S i^ I 
at t -\- 1). This lack of separation is one indication that a pre- 
dictor which combines information from several indices may 
improve link prediction efforts. Figure [2] also reveals that the 
manner in which the predictors should be combined is not as 
straightforward as one might envision. For example, some sim- 
ilarity indices, such as Adamic-Adar (Fig. [2})) and Resource 
Allocation (Fig. [2|) show promising potential for differentiat- 
ing links and duds. Other indices, such as Twitter Id similarity 
(Fig. [2j)) maintain a greater number of duds than links, across 
all scores. This is a result of the large class imbalance between 
the number of potential user-user pairs for new links and the 
actual numbers of new links formed, a common occurrence in 
large, sparse networks. 

2.3. Evolutionary algorithm 

Evolutionary algorithms take inspiration from biological 
systems whereby individuals representing candidate solutions 
evolve over generational time via selection, reproduction, mu- 
tation, and recombination. In our task, we construct a linear 
combination of similarity indices, St, and use an evolutionary 
strategy to evolve the coefficients, w/, in the overall score 



(CMA-ES) with both rank-1 and rank-yu updateqj we evolve 
w = (wi, W2, . . . , wi6) G M^^ over 250 generations 1291 . Entries 
of w are initialized to random values between and 1 , however 
these values are not constrained during evolution. At each gen- 
eration, CMA-ES generates a multivariate Gaussian cloud of 
candidate solution^ about an individual |j The standard imple- 
mentation of CMA-ES selects the "best solution" as that which 
minimizes fitness. As such, our fitness function (1) ranks the 
entries from S , (2) selects user-user pairs exhibiting the top N 
scores (taken at the topN most likely links), and (3) assesses 
the percent of these user-user pairs which do not exhibit a link 
in the next week. The best solution is selected as the candidate 
with the minimal fitness. We run 100 different initializations 
for each of four fitness functions fitness2o, fitness200T fitness2ooo5 
fitness2oooo where the subscript denotes the topN scoring user- 
user pairs (e.g., predicted links). By incorporating fitness func- 
tions which operate at different scales, we investigate the sensi- 
tivity of the topN on the link predictor's performance in valida- 
tion. A schematic visualization of these steps is provided in the 
Appendix (Fig. |Al| ). 

Our choice for CMA-ES stems from its efficiency in find- 
ing real valued solutions in noisy landscapes |39|. In con- 
trast to gradient descent approaches for finding optimal solu- 
tions, CMA-ES is not reliant on assumptions of differentiabil- 
ity nor continuity of the fitness landscape. Our method requires 
no prior knowledge nor heuristics, which is an advantage over 
many existing supervised learning methods (e.g., SVM) that re- 
quire extensive parameter tuning and kernel selection 1 29 1 . Ad- 
ditionally, our method is flexible and allows for any similarity 
index to be substituted into or added to the evolutionary algo- 
rithm. Ideally, the transparency of the evolved "best" predictors 
will help illustrate possible driving mechanisms behind the net- 
work's evolution. 

2.4. Cross referencing links 

From the 100 best solutions evolved via CMA-ES for each 
of the four fitness functions (e.g., where the top 20, 200, 2000 
or 20000 scores are used to predict future links) we cross- 
reference the top N scoring user-user pairs. The user-user pairs 
which are most heavily cross-referenced (i.e. links which most 
models agree upon) are those for which we predict a link. In 
addition to the 400 best evolved predictors, we also feed in in- 
formation from the Resource Allocation similarity index when 
prediction top N <10 because of the high performance of this 
index for predicting the top 10 or fewer links on training sets. 
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S = Y^WiSi. 



(1) 



Using Covariance Matrix Adaptation Evolutionary Strategy 



^Briefly, rank-1 updates utilize information about correlations between gen- 
erations, which is helpful for evolution with small populations. Rank-// up- 
dates utilize information from the current generation, which helps speed up 
the algorithm for large populations. We refer the interested reader to https: 
//www. Iri . fr/~hansen/cmatutorial 110628 . pdf for more detail. 

^We use the default population size of 4 -i- [3 log(m)J, for solutions in W^, 
from Hansen's source code available at https : //www . Iri . f r/~hansen/| 
iCinaes_inmatlab . htm l Increasing the population size did not improve our 
results. 

^We focus on training for a specific week, e.g., new links that occur from 
Week 1 to 2, Week 3 to 4, Week 7 to 8 or Week 9 to 10. We also run a set of 
experiments whereby a randomly selected week in the early set of networks is 
chosen at each generation. Both sets of experiments are presented in the results 
and discussion. 



Topological similarity indices (abbreviation) 



Jaccard Index (J) 



Adamic-Adar Coefficient (A) 



Common neighbors (C) 



Average Patli Weight (P) 



Katz (K) 



Preferential Attachment (Pr) 

Resource Allocation (R) 
Hub promoted Index (Hp) 

Hub depressed Index (Hd) 
Leicht-Holme-Newman Index (L) 

Salton Index (Sa) 
Sorenson Index (So) 



J(u, v) -- 



A(u,v)= Z 



, iog(\r(z)\) 



C(w,v) = |r(M)nr(v)| 



P(U, V) : 



2 wp 

pepathsj^ yUpaths^ y 
\pathsl^y\+\pathsl^y\ 



K^X y6"A« 



Pr(u, v) - kuX kv 



zer(M)nr(y) ' ^^^' 



Hp(x y) = l^^")^r(")l 



Hdiu v) = l^(")^^("^l 



^„,,)=E|np 






Measures the probability that a neighbor of m or v is a neighbor of both u and v. This 
measurement is a way of characterizing shared content and has been shown to be meaningful 
in information retrieval 1 10 . 

Quantifies features shared by nodes u and v and weights rarer features more heavily 111. In- 
terpreting this in the context of neighborhoods, the Adamic-Adar Coefficient can be used to 
characterize neighborhood overlap between nodes u and v, weighting the overlap of smaller 
such neighborhoods more heavily. 

Measures the number of shared neighbors between u and v. Despite the simplicity of this in- 
dex, Newman 1 9 documented that the probability of future links occurring in a collaboration 
network was positively correlated with the number of common neighbors. 

Computes the sum of the minimum weights on the directed paths between u and v divided 
by the number of paths between u and v, where only paths of length 2 and 3 are considered 
due to the large size of this network. We take Wp to be the minimum weight of the edges in 
the path, in the spirit that a path's strength is only as strong as its weakest edge. 

Computed as such, the Katz is a global index |12|. This series converges to (/ - jSA)'^ - I, 
when/3 < max(/l(A)). WhenyS «: 1 then K approximates the number of common neighbors. 
Due to the size of our network and computational expense of this index, we truncate to « = 3. 
We setyS = 1 because we are not concerned with convergence & to emphasize the number of 
paths of length greater than two. Previous observations suggest that individuals who appear 
to be connected by a path length of n in Twitter RRNs may actually be connected by a path 
of shorter length due to role of missing data 1 26 . 

Gives higher scores to pairs of nodes for which one or both have high degree. This index 
arose from the observation that nodes in some networks acquire new links with a probability 
proportional to their degree 1 9 and preferential attachment random growth models 1 34 . 

Considers the amount of a given resource one node has and assumes that each node will 
distribute its resource equally among all neighbors 1 3 1. 

First proposed to measure the topological overlap of pairs of substrates in metabolic net- 
works, this index assigns higher scores to links adjacent to hubs since the denominator de- 
pends on the minimum degree of the two users 1 35 . 

When one of the nodes has large degree, the denominator will be larger and thus Hd is 
smaller in the case where one of the users is a hub 1 33 . 

Measures the number of common neighbors relative to the square of their geometric mean. 
This index gives high similarities to pairs of nodes that have many common neighbors com- 
pared to the expected number of such neighbors 1 36 1. 

Measures the number of common neighbors relative to their geometric mean |10| . 

Measures the number of common neighbors relative to their arithmetic mean. This index is 
similar to /, however / counts the number of (unique) nodes in the shared neighborhood. 
This index was previously used to establish equal amplitude groups in plant sociology based 
on the similarity of species 1 37 . 



Individual characteristics similarity indices 



Id similarity (I) 

Tweet count similarity (T) 

Happiness similarity (H) 
Word similarity (W) 



I(u, v) = 1 - 



\Id(u)-Id(v)\ 
max{\Id(a)-Id(b)\]^l^^V 



T(u V)=l \T(-yT(v)\ 



H(m, v) = 1 - 



\h(u)-h(v)\ 



50000 
W(u,v)^l-\ Z \fu,n-fv,n\ 



In 2008, user ids were numbered sequentially and a user's id served as a proxy for the 
relative length of time since opening a Twitter account. Id similarity characterizes the extent 
to which two individuals adopt Twitter simultaneously. 

Tweet count T(u) measures the number of Tweets we have gathered for node m in a given 
week. Tweet count similarity quantifies how similar two individuals' tweet counts are, with 
1 representing identical tweet counts and representing dissimilar tweet counts. 

Building on previous work 1381 , happiness scores (h(u) and h(v)) are computed as the aver- 
age of happiness scores for words authored by users u and v during the week of analysis. 

From a corpus consisting of the 50,000 most commonly occurring words used in Twitter 
from 2008 through 2011 |38 , the similarity of words used by u and v is computed by a 
modified Hamming distance, where fu,n represents the normalized frequency of word usage 
of the «th word by user u. The value of W(u, v) ranges from (dissimilar word usage) to 1 
(similar word usage) 1 26 . 



Table 1 : The sixteen similarity indices chosen for inclusion in the link predictor. We define the neighborhood of node u to be T{u) 
G = (V, E) is a network, consisting of vertices (V) and edges (E). The degree of node u is represented by ku. 
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0.2 0.4 0.6 0. 

Score from predictor 



1 0.2 0.4 0.6 0.6 

Score from predictor 



(e) mod. Katz 



(f) PrefAtt. 



0.2 0.4 0.6 0.8 

Score from predictor 

(g) Resource All 




rv^v^ . 



++■(*■ 



0.2 0.4 0.6 0. 

Score from predictor 

(i) Hub promoted 



1 0.2 0.4 0.6 0.£ 

Score from predictor 

(j)LHN 



•• • •••••)- 

' ' -mm-titm — 4— »- 

0.2 0.4 0.6 0.8 

Score from predictor 

(k) Salton 




0.2 0.4 0.6 0.8 

Score from predictor 

(m) Twitter Id similarity 



J, 



Links 
Duds 



0.2 0.4 0.6 0.8 

Score from predictor 

(d) Paths 



a- •*^.i- "^ ++ 



• "'MW • + • + 
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0^2 *o!4**"a6' ' 0^8* 
Score from predictor 

(h) Hub depressed 



0.2 0.4 0.6 0.8 

Score from predictor 

(1) Sorenson 



0.2 0.4 0.6 0.8 

Score from predictor 

(n) Tweet count similarity 
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^^^A" '*' . . 
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0.2 0.4 0.6 0.8 

Score from predictor 

(o) Happiness similarity 



0.2 0.4 0.6 0.8 

Score from predictor 

(p) Word similarity 



Figure 2: Scores for user-user pairs with path length two in Week 7, which exhibit a link (blue) and which did not (red) in Week 8. For many indices, there are 
more "duds" than "links" for a given score. Indices for which there are "links" scoring higher than "duds" tend to exhibit a large, positive evolved coefficient (e.g., 
Adamic-Adar). 



2.5. Validation measures 

The confusion matrix identifies true positives ("hits"), false 
positives ("false alarms"), false negatives ("misses"), and true 
negatives ("correct rejections"). These will be designated as 
TP, FP, FN, TN, respectively (Fig. [3]). From the confusion ma- 
trix, several measures for assessing a classifier can be computed 
(Table[2]). 

The Receiver Operating Characteristic (ROC) curve depicts 
the true positive rate (TPR) as a function of the false positive 
rate (FPR) Ii40il . A classification method which randomly as- 



signs true or false to the presence of future links would, on 
average, have TPR equal to FPR. Successful classifiers have 
TPR > FPR and this is often quantified by estimating the area 
under the curve (AUC) for the ROC. The AUC approximates 
the probability that a link predictor will assign a higher score to 
user-user pairs who exhibit a link in the next time step than to 
user-user pairs who do not exhibit a link in the next time step. 
This can be computed using a trapezoidal approximation. In 
practice, many researchers approximate AUC ^ ^'^^f^' , where 
n represents the number of comparisons, n^ represents the num- 
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Figure 3: Confusion matrix showing the true positives (TP), false positives 
(FP), false negatives (FN), true negatives (TN), predicted positives (P'), pre- 
dicted negatives (N'), true state positives (P) and true state negatives (N). As 
shown in the diagram, rows and columns sum as follows: P' = TP + FP, 
N' = FN + TN, P = TP + FNmdN = FP + TN. 



Sensitivity (true positive rate, recall) 


TPTf — ^^ 


^ ^^ TP+FN 


False positive rate 


ppp _ FP 


ttK- pp^^^ 


Accuracy 


\rr - TP+TN 


.iL.L. - pp^p^^pp^p^ 


Specificity (true negative rate) 


TAIP — ^^ 


INK- pp^p^ 


Positive predictive value (precision) 


ppy _ TP 


^^V - pp^pp 


Negative predictive value 


NPV — ^^ 


^^^^ - TN+FN 


False discovery rate 


pnp _ FP 


tUK- pp^pp 



Table 2: Measures for quantifying the relative success of a signal detection 
method 1401 . A classifier may score well in one measure and poorly in another. 
The key to choosing a successful link predictor is to focus on the important 
indicators of success given by the context of the problem. 



ber of times that user-user pairs which receive a new Hnk in the 
next time step receive a higher score than randomly selected 
user-user pairs which do not have a link in the next time step and 
n" represents the number of times that they have equal scores. 

For large, sparse networks, the negative class is often much 
larger than the positive class. In our case, the number of new 
links (positive class) is on the order of 10^, whereas the number 
of potential links which do not exhibit future links (negative 
class) is on the order of 10^. Given this imbalance, measures 
such as accuracy, negative predictive value, and specificity (Ta- 
ble |2]) will be very close to 1, even for random link predictors. 

As suggested by Wang et al. |4|, more emphasis should be 
placed on recall and precision due to the large class imbalance 
between positives and negatives. We report these individual 
measures, as well as the Fp score which incorporates both pre- 
cision and recall. The tunable parameter p allows for unequal 
weighting on recall vs. precision: 



Fp^H^p")- 



precision • recall 
(0^ • precision) -h recall ' 



(2) 



In some applications, false positives ("false alarms") may be 
relatively costless, whereas false negatives ("misses") may pose 
an imminent threat. In these cases, recall is much more impor- 
tant than precision and setting JS > I will weight recall more 



heavily in the Fjs score. In contrast, other applications may in- 
volve scenarios where false positives are costly to explore and a 
small number of links, for which we are fairly certainly about, 
is highly prized. In these cases, one can set/5 < 1 to place more 
importance on precision. 

2.6. Exploring the impact of missing data 

During the twelve week period from September 9, 2008 - 
Dec 1, 2008 we received approximately 40% of all tweets from 
Twitter's API service (Table Al). There are therefore both indi- 
viduals and interactions that are unaccounted for in our training 
and validation period. Consequently, there are individuals who 
are connected by a path of length two in the true network, but 
which appear to be connected by a longer path because we have 
not captured interactions for intermediaries. 

We explore the potential impact of missing tweets on our 
predictor by randomly selecting 50% of our observed tweets 
and constructing the reciprocal reply subnetworks for Weeks 1 
through 12. The evolutionary algorithm trains and validates on 
these subnetworks. For clarity, we denote G for our observed 
networks and G\ for our subnetworks. We identify the per- 
cent of links which are labeled as false positives in G^ and true 
positive in G. This occurs precisely because our link predictor 
suggested a link which was actually correct, but for which an 
incomplete data set caused the link to be classified as a false 
positive. As such, we are underestimating the success of our 
link prediction method. Given a more complete data set, our 
results would most likely be better than we report here. 

3. Results 

Our overall finding is that the evolved predictor consisting of 
all sixteen similarity indices outperformed all other combined 
and individual indices on the training data when training oc- 
curred on a given week's RRN|J 

For illustration purposes within the text, we present the re- 
sults for fitness2o during training on new links formed from 
Week 7 to Week 8 (Fig. |4]). We show the average best fitness 
throughout 250 generations for each experiment in the Appen- 
dices (23261 

We will focus primarily on the training which occurred on a 
given week's RRN. For example, in Figure |4j the solid black 
curve depicting the "all 16" predictor shows that while the aver- 
age fitness at generation 1 for the 100 candidates was far worse 
(^ 0.65) than several similarity indices such as Adamic-Adar 
(^ 0.55), Common neighbors (^ 0.55) and Resource Alloca- 
tion ^ 0.60), convergence to a far better set of solutions oc- 
curred within 100 generations (^ .22). The combination of the 
twelve topological indices outperformed all individual indices, 
but was outperformed by the all 16 predictor. This diflTerence is 
most pronounced for the top N=20 cases, however this trend 



holds true for the other fitness functions (Appendix, Figs. |A2 
lA5 



^For training on randomly selected weeks in the "early" networks, the pre- 
dictors consisting of the 12 topological indices outperformed the combined pre- 
dictor, except in the case when the fitness function selected the top 20 scores 
(Fig.|A6li). 
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Figure 4: Mean best fitness computed from 100 simulations of CMA-ES for 
training on the new links that occur in Week 8 (i.e., links present in Week 8 that 
were not present in Week 7) using fitness2o- The evolutionary algorithm seeks 
to minimize fitness (i.e., minimize the proportion of falsely predicted links). We 
compare each individual index (shown in color), along with the three evolved 
predictors (shown in black): "alll6" (all 16 indices), "topol2" (12 topological 
indices), and "node4" (4 individual similarity indices). 



We present the best solutions produced from each of the 
100 CMA-ES runs in the Appendices [A7l|A9 



For illustra- 
tion purposes, we highlight the results from Week 8, using a 
fitness function which selects the top 20 scores as new links, 
in Figure [5] Figure |5^ shows all 100 best candidate solutions 
which evolved after 250 generations of CMA-ES, w, as hori- 
zontal rows. The /th column signifies the w/ coefficient used 
in the linear combination of the weights. The color axis re- 
veals the value of /th coefficient. Several trends are worth not- 
ing here. First, there is considerable variability between the 
100 evolved best candidates. Second, despite this variabil- 
ity, Adamic-Adar, Common neighbors. Resource Allocation, 
Happiness, and Twitter Id similarity columns have many more 
positive values than negative. On the other hand, the coeffi- 
cient for the Leicht-Holme-Newman index often evolved to a 
large negative weight. This signifies that user-user pairs which 
had high scores for the indices which evolved large, positive 
weights (e.g., Adamic-Adar, Common neighbors. Resource Al- 
location, Happiness, and Id similarity) and low scores for the in- 
dices which evolve large, negative weights (e.g., Leicht-Holme- 
Newman) were more likely to exhibit a future link. 

We also visualize the relative ranking of the indices by their 
coefficients the Fig. [5J3 (and corresponding plots in the Appen- 
dices |A7f|A9|). Ordering the coefficients from greatest (most 



positive in 1st place) to least (most negative in 16th place) re- 
veals that Adamic-Adar, Common neighbors. Resource Allo- 
cation, Happiness, and Twitter Id similarity often occupied the 
lst-4th rankings (i.e., indices with the largest positive contribu- 
tion to the score matrix), whereas LHN was often in 16th place 
(the largest negative weight). Other indices showed consider- 
able variability in their ranking. We explore the implications of 
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(b) Frequency plot for rank of index coefficients 

Figure 5: (a.) Presentation of the best solutions evolved from each of 100 
simulations using fitness2o and the "all 16" predictors to predict new links that 
occurred from Week 7 to 8. (b.) Frequency plot of ranked coefficients from (a.), 
where 1st place represents large, positive coefficients and 16th place represents 
large, negative coefficients. Disk size indicates the fraction of times an index re- 
ceived a given ranking. Adamic-Adar, Happiness similarity. Resource Alloca- 
tion and Twitter Id similarity were the most commonly occurring indices ranked 
1st (largest, positive) coefficient, and LHN often evolved to the largest, negative 
coefficient. This suggests possible mechanisms which may have been driving 
the evolution of the network during this time period. J=Jaccard, A=Adamic- 
Adar, C=Common neighbors, P=Paths, K=Katz, Pr=Preferential attachment, 
R=Resource allocation, Hd=Hub depressed, Hp=Hub promoted, L=Leicht- 
Holme-Newman, Sa=Salton, So=Sorenson, I=Twitter id similarity, T=Tweet 
count similarity, H=Happiness similarity, W=word similarity. 



these findings in the discussion. 
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Figure 6: Receiver Operating Curve (ROC) for the "all 16" predictors evolved 
from CMA-ES with fitness2oooo- ^f^C'week2^3 = •723,Af/Cweek4^5 = 
.721,Af/Cweek8^9 = .726,, and Af/Cweek 10^11 = .707. 

The ROC curve demonstrates that the true positive rate is 
considerably larger than the false positive rate iJPR > FPR) 
(Figs. [6] and Appendix |ATT). We find AUC scores greater than 
0.7 for all weeks in the validation set. Tuning JS to one of 0.5, 
1 or 2, we find that the Fi peaks around top N ^ 10"^ (see 
Equation [2]). F- scores are higher for weeks during which we 
received a higher percentage of tweets from the Twitter API 
service. For example, F0.5 = 0.203, Fi = .177,^2 = .142, 
and F0.5 = 0.226, Fi = .181,^2 = .143 for links which oc- 
curred from Weeks 8 to 9 and Weeks 10 to 11, respectively. 
In Week 5 we received a far smaller percentage of tweets. F- 
scores for new links occurring from Weeks 4 to 5 are F0.5 = 
0.184,^1 = .152,^2 = .128 (Fig.[ATQ|). We present an accom- 



panying plot for F- scores when training occurred on randomly 



selected weeks in the early set in the Appendix (Fig. |A10| ) 

Figure[8]depicts the precision of the predicted links as a func- 
tion of the top N scoring user-user pairs. High precision is 
achieved for the fitness function which operates by selecting 
the top 20 scoring user-user pairs, which is often the region 
of interest. Precision is lower for predicted links from Week 
4 to 5, a week in which we received a very low percentage of 
tweets from the Twitter API service, and higher for predicted 
links from Week 8 to 9 and Week 10 to 1 1, weeks for which we 
received a higher percentage of tweets from the Twitter API ser- 
vice (see Table A2). We also compute negative predictive value 
(NPV), and find this is consistently close to 1 due to the large 
true negative class. Specificity and accuracy are close to 1.0 for 
nearly all values of top N links predicted, except for particularly 
large N (> 10"^). This is due to the large class imbalance of true 
negatives (TN), which dominate the numerator and denomina- 



tor of these calculations |j The implications of this finding are 
more fully address in the discussion. 

We next investigate the eff'ects of missing data on our pre- 
dictor, under the condition that 50% of the Tweets have been 
removed. We observe that the number of correctly predicted 
links is hindered by the missing data, and the proportion of links 
which are incorrectly termed "false-positive" because they are 
actually links in the weekly network containing a more com- 
plete data set is roughly 10% (Appendix |A12| ). This result from 
bootstrapping suggests that the performance of our predictors 
is a lower bound on performance, i.e., true precision and recall 
are most likely better than we report. 
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Figure 7: Fjs scores for each of the validation sets (W2 -^ 3, W4 -^ W5, W8 -^ 
9,W10 ^ 11) encode information about the performance of the link predictor 
with respect to precision and recall. When fi = 1, precision and recall are 
weighted equally, fi > I weights recall {TPR = yp^^p^^^ ), whereas yS < 1 
places more importance on precision (PPV = jp^pp )- Our predictor performs 
better with respect to precision and peaks for values on the order of 10^. The 
standard Fi score peaks around 10^ and compares favorably with the work 
of 1 17 1 . The highest F^ scores are found for WIO ^11. 

Notable works in the area of link prediction have reported 
the factor improvement over random link prediction 1 1 , 4|. We 
follow suit and compute the factor improvement of our predic- 
tor over a randomly chosen pair of users. The probability that a 
randomly chosen pair of individuals who are not connected in 



week / become connected in week / -F 1 is 



|Edges„, 
('T')-lEdge 



There 



are 44,439 nodes in the validation set and, as a sample calcula- 
tion, 71,927 edges in week 7. There are 53,722 new links that 
occur from Week 7 to 8. Thus, the probability of a randomly 
chosen pair of nodes from Week 7 exhibiting a link in Week 8 



is approximately 



53,722 



.0054%. 



(^2^^'^)-71,927 

We observe significant factors of improvement over ran- 
domly selected new links, usually on the order of 10"^ for top 
N <20 (Fig. [9]). We noticed that Resource Allocation outper- 
formed other similarity indices when used in isolation to select 



^Accompanying plots depicting the average precision, negative predictive 
value, accuracy, specificity, true positive rate, and false positive for varying 
from the best predictors evolved from each of 100 CMA-ES runs are presented 
in the Appendices (Figs.|A13"[jA16|. 
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Figure 8: Precision ( jp^pp ) for the predicted links in the validation sets {W2 -^ 
3, WA -^ W5, W8 -^ 9, WIG -^ 11). High precision is achieved for topA^ < 20, 
which is often the region of interest. The precision for predicted links in WA -^ 
W5' is lower than the other weeks and this may be due to missing data for those 
weeks (see Table A2). 



the top 5 links during training and have included this in the 
cross-validation (Predictor/^^) validations step for selecting the 
top 10 (or fewer) links. We observe that the combined predic- 
tor outperforms indices used in isolation most choices of top N 
link prediction. 



4. Discussion 

Several studies have suggested that the inclusion of topolog- 
ical similarity indices, along with node- specific similarity in- 
dices can greatly enhance link prediction eff'orts ll2ll4l [T6H20l . 
Indeed, we find support for this claim in our work with Twitter 
reciprocal reply networks. For experiments in which training 
occurred on a given week, we find that the combined "all 16" 
predictor outperforms the topological only predictor "topol2" 
and find that this diff'erence is most pronounced for top N <20. 

When training occurred on the early set, whereby a random 
week from the early set was selected at each generation, pre- 
dictors consisting only of topological indices ("topol2') per- 
formed the best. When training occurred on a finer time step 
(e.g., week), the link predictor was able to hone in on nu- 
ances specific to the state of the network at that time. This 
improved our ability to predict the top N < 20 new links. 
Training at a coarser scale allows for the detection of in- 
dices exhibiting robust predictive capabilities, less sensitive to 
weekly diff'erences. This is evidenced in AUC scores which 
are slightly higher for predictors evolved on the early set as 
compared to predictors evolved on the weekly training set 

(e.g., A[/Cweek8^9,early = -738, At/Cweek8^9,weekly = -726, and 
A[/Cweekl0^11,early = -725, A L^Cweekl 0^11, weekly = .707). The 

decision to train over a coarser time step (e.g., early set) or at a 
finer scale (e.g., weekly sets) will depend on the goal of maxi- 
mizing recall or precision. 



Our measures perform quite well in comparison to other re- 
searchers working in the area of link prediction for Twitter. 
Rowe, Stankovic, and Alani |17| explore topological and in- 
dividual specific similarity indices (words and topic similar- 
ity) in an eff'ort to predict following behavior. They find an 
AUC < 0.6 whereas we find AUC > 0.7 for all experiments. 
Yin, Hong, and Davison 1 8 1 develop a structure based link pre- 
diction model and report F-scores on the order of F = .190 
for Twitter follower networks. These networks do not suff'er 
from incomplete data in the same way that Twitter reciprocal 
reply networks do. Our predictor performs comparatively well, 
with scores ranging from Fi =0.152 for validation on new links 
occurring from Week 4 to 5, a week for which we obtained ap- 
proximately 24% of all tweets, to Fi =0.181 for validation on 
new links occurring from Week 10 to 11, a week for which we 
obtained approximately 48% of all tweets. 

We have developed a meaningful link predictor for Twitter 
reciprocal reply networks, a social subnetwork consisting of in- 
dividuals who demonstrate active and ongoing engagement. We 
were able to achieve a factor of improvement over random link 
selection on the order of 10^ for the top 20 (or fewer) links 
predicted and 10^ over several orders of magnitude for the top 
N links predicted. Wang et al. |4| examine a social network 
constructed from mobile phone call data and find a factor im- 
provement of approximately 1.5 x 10^. To compare our work, 
however, one must standardize for the number of nodes in the 
network|jUpon doing so, we find our factor improvement is an 
order of magnitude higher. 

One of the most intriguing aspects of this work is the detec- 
tion of similarity indices which evolve to have large, positive 
weights in our link predictors. Perhaps the most notable sim- 
ilarity indices for which this is the case is the Resource Allo- 
cation index. Resource allocation considers the amount of re- 
source one node has and assumes that each node will distribute 
its resource equally among all neighbors |3|. Considering the 
limits to time and attention an individual has, this may be sug- 
gestive of a mechanism by which users limit their interaction in 
Twitter RRNs, a result also suggested by Gon9alves et al. 1411 . 

In addition to suggesting that our work is comparable to or 
an improvement upon other work which combines of measures 
via supervised learning, we present a method which is trans- 
parent and transferable. Future work may involve the inclusion 
of geospatial data |42| or community structure to predict links. 
Eff'orts to consider the persistence or decay of links over time 
could also prove fruitful. 
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^These researchers report 579,087,610 potential new links and a factor im- 
provement of 1500. Rescaling the factor improvement for networks of the same 
size amounts to computing the probability of a randomly predicted link being 
correct. 
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Figure 9: Factor improvement over randomly selected user-user pair is depicted. Large factor improvements are exhibited for predicting the top N links, with notable 
peaks for N <100. The combined predictor outperforms the Common neighbors, Adamic-Adar, Paths, Katz, and Resource Allocation indices used in isolation over 
most choices for the top N links predicted. 
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(b) Fitness function 

Figure Al: (a.) From 100 initializations of w e M}^ with entries between and 1 representing weights ("individuals" in an evolutionary sense), CMA-ES creates a 
multi-variate Gaussian cloud of candidate solutions ("population") using information encoded in the covariance matrix. Pre-computed similarity scores for nodes 
in the weekly networks are stored in sparse matrices St. The scores are computed as the linear combination S = Y,]=i ^iSt. The vector, w* with the best fitness (i.e., 
producing the least number of falsely predicted links) is selected as the fittest individual and survives to the next generation. Evolution occurs for 250 generations 
under one of four fitness criteria: the top 20, top 200, top 2000 or top 20,000 scores used to predict new links. 
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Week 


Start date 


# Obsvd. Msgs. 


# Total Msgs. 


% Obsvd. 


# Replies 


% Replies 






xlO<^ 


xlO<^ 


(rs-xioo) 


xlO*' 


(^Replies ,r^\ 
[#Obsvd. ^ ^^) 


1 


09.09.08 


3.14 


7.26 


43.2 


0.88 


28.1 


2 


09.16.08 


3.36 


8.31 


40.4 


0.90 


26.9 


3 


09.23.08 


3.43 


8.89 


38.6 


0.90 


26.2 


4 


09.30.08 


3.33 


9.06 


36.8 


0.89 


26.6 


5 


10.07.08 


2.33 


9.38 


24.8 


0.64 


27.5 


6 


10.14.08 


4.39 


9.87 


44.4 


1.24 


28.3 


7 


10.21.08 


4.70 


10.01 


47.0 


1.35 


28.8 


8 


10.28.08 


5.74 


10.34 


55.5 


1.64 


28.5 


9 


11.04.08 


5.58 


11.14 


50.1 


1.63 


29.3 


10 


11.11.08 


4.70 


9.88 


47.6 


1.42 


30.2 


11 


11.18.08 


5.48 


11.34 


48.3 


1.67 


30.5 


12 


11.25.08 


5.71 


11.47 


49.8 


1.73 


30.2 



Table Al: The number of "observed" messages in our database comprise a fraction of the total number of Twitter message made during period of this study 
(September 2008 through November 2009). While our feed from the Twitter API remains fairly constant, the total # of tweets grows, thus reducing the % of all 
tweets observed in our database. We calculate the total # of messages as the difference between the last message id and the first message id that we observe for 
a given month. This provides a reasonable estimation of the number of tweets made per month as message ids were assigned (by Twitter) sequentially during the 
time period of this study. We also report the number observed messages that are replies to specific messages and the percentage of our observed messages which 
constitute replies. 
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Figure A2: Mean fitness computed from 100 simulations of CMA-ES for training on the new links that occur in Week 2 (i.e., links present in Week 2 that were not 
present in Week 1) for each of four fitness functions: (a.) top 20, (b.) top 200, (c.) top 2000 and (d.) top 20,000 scores used to predict new links. We compare each 
individual index, along with "all 16" (evolved predictor consisting of all 16 indices), "topol2" (evolved predictor consisting of only the 12 topological indices), and 
"node4" (evolved predictor consisting of only the 4 node similarity indices). To show detail, the axes are not uniformly scaled between each panel. 
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Figure A3: Mean fitness computed from 100 simulations of CMA-ES for training on the new links that occur in Week 4 (i.e., links present in Week 4 that were not 
present in Week 3) for each of four fitness functions: (a.) top 20, (b.) top 200, (c.) top 2000 and (d.) top 20,000 scores used to predict new links. We compare each 
individual index, along with "all 16" (evolved predictor consisting of all 16 indices), "topol2" (evolved predictor consisting of only the 12 topological indices), and 
"node4" (evolved predictor consisting of only the 4 node similarity indices). To show detail, the axes are not uniformly scaled between each panel. 
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Figure A4: Mean fitness computed from 100 simulations of CMA-ES for training on the new links that occur in Week 8 (i.e., links present in Week 8 that were not 
present in Week 9) for each of four fitness functions: (a.) top 20, (b.) top 200, (c.) top 2000 and (d.) top 20,000 scores used to predict new links. We compare each 
individual index, along with "all 16" (evolved predictor consisting of all 16 indices), "topol2" (evolved predictor consisting of only the 12 topological indices), and 
"node4" (evolved predictor consisting of only the 4 node similarity indices). To show detail, the axes are not uniformly scaled between each panel. 
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Figure A5: Mean fitness computed from 100 simulations of CMA-ES for training on the new links that occur in Week 10 (i.e., links present in Week 10 that were 
not present in Week 9) for each of four fitness functions: (a.) top 20, (b.) top 200, (c.) top 2000 and (d.) top 20,000 scores used to predict new links. We compare 
each individual index, along with "all 16" (evolved predictor consisting of all 16 indices), "topol2" (evolved predictor consisting of only the 12 topological indices), 
and "node4" (evolved predictor consisting of only the 4 node similarity indices). To show detail, the axes are not uniformly scaled between each panel. 
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Figure A6: Mean fitness computed from 100 simulations of CMA-ES for training on the new links that occur in randomly selected Week t from the "early" weeks 
(i.e., links present in Week t that were not present in Week t - 1) for each of four fitness functions: (a.) top 20, (b.) top 200, (c.) top 2000 and (d.) top 20,000 
scores used to predict new links. We compare each individual index, along with "alll6" (evolved predictor consisting of all 16 indices), "topol2" (evolved predictor 
consisting of only the 12 topological indices), and "node4" (evolved predictor consisting of only the 4 node similarity indices). To show detail, the axes are not 
uniformly scaled between each panel. 
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Figure A7: Ranking of the value of the evolved coefficients from each of 100 CMA-ES runs when fitness is based on the percent of correctly predicted links 
from the top N scores. Adamic-Adar is the most frequently chosen top ranking (i.e., heavily weighted) index, followed by common neighbors and resource 
allocation. The lowest ranking index was LHN. Individual similarity indices, such as happiness, word similarity. Twitter user Id and Tweet count were ranked 
intermediate. J=Jaccard, A=Adamic-Adar, C=Common neighbors, P=Paths, K=Katz, Pr=Preferential attachment, R=Resource allocation, Hd=Hub depressed, 
Hp=Hub promoted, L=Leicht-Holme-Newman, Sa=Salton, So=Sorenson, I=Twitter Id similarity, T=Tweet count similarity, H=Happiness similarity, W=word 
similarity. 
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Figure A8: Ranking of the value of the evolved coefficients from each of 100 CMA-ES runs when fitness is based on the percent of correctly predicted links 
from the top N scores. Adamic-Adar is the most frequently chosen top ranking (i.e., heavily weighted) index, followed by common neighbors and resource 
allocation. The lowest ranking index was LHN. Individual similarity indices, such as happiness, word similarity. Twitter user Id and Tweet count were ranked 
intermediate. J=Jaccard, A=Adamic-Adar, C=Common neighbors, P=Paths, K=Katz, Pr=Preferential attachment, R=Resource allocation, Hd=Hub depressed, 
Hp=Hub promoted, L=Leicht-Holme-Newman, Sa=Salton, So=Sorenson, I=Twitter Id similarity, T=Tweet count similarity, H=Happiness similarity, W=word 
similarity. 
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Figure A9: (a-d) Presentation of the best solutions evolved from each of 100 simulations when fitness is determined by ranking the top N scores, using a randomly 
selected "early" week at each generation of CMA-ES ("early" signifies Weeks 1-6). Considerable variability exists between the rows (100 simulations) and this is 
most likely due to the last generation of runs ending on diff'erent weeks. This coarse grain approach suggests indices which provide an improvement in any given 
week, but not necessarily the most improvement for a given week, (e-f) Frequency plots summarizing the data from (a-d). Adamic-Adar, Common neighbors, and 
Resource Allocation evolve to have large, positive weight. LHN often evolves to have a large, negative weight. Individual similarity indices (such as happiness, 
word similarity. Twitter user Id and Tweet count evolve to have coefficients close to zero, suggesting that at a coarse level, they do not help (nor hinder) the 
link predictor. J=Jaccard, A=Adamic-Adar, C=Common neighbors, P=Paths, K=Katz, Pr=Preferential attachment, R=Resource allocation, Hd=Hub depressed, 
Hp=Hub promoted, L=Leicht-Holme-Newman, Sa=Salton, So=Sorenson, I=Twitter Id similarity, T=Tweet count similarity, H=Happiness similarity, W=word 
similarity. 
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Figure A 10: Fp scores encode information about the performance of the link 
predictor with respect to precision and recall. WhenyS = 1, precision and recall 
are weighted equally, p > I weights recall {TPR = jp-^j—^), whereas a 
yS < 1 places more importance on precision {PPV = 



-^). Our predictor 
performs better with respect to precision and peaks for values on the order of 
10^. The standard Fi score peaks around 10"^ and compares favorably well with 
the work of |17| . 
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Figure All: Receiver Operating Curve (ROC) for the best topological only 
solutions derived from CMA-ES using a fitness function which selects the top 
20,000 scores as new links and evolving in the early (Weeks 1-6) networks. 
Validation occurred on the late (Weeks 7-12) set of networks. Af/Cweek7^8 = 

.717,Af/Cweek8^9 = .738, Af/Cweek9^10 = -733, Af/Cweek 10^11 = -725, 

andAf/Cweekii^i2 = -728. 
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Figure A12: The proportion of incorrectly labeled false positives due to missing 
data when 50% of our observed tweets were hidden from view and networks 
were recreated using this subsample of the data for Week 7 to 8. 
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Figure A13: Average sensitivity, false positive rate, accuracy, specificity, positive predictive value, negative predictive value, and false discovery rate for the 100 
"alll6" predictors evolved from fitness which selects the top 20, 200, 2000 or 20000 scores as new links. 
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Figure A 14: Average sensitivity, false positive rate, accuracy, specificity, positive predictive value, negative predictive value, and false discovery rate for the 100 
"topol2" predictors evolved from fitness which selects the top 20, 200, 2000 or 20000 scores as new links. 
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Figure A15: Average sensitivity, false positive rate, accuracy, specificity, positive predictive value, negative predictive value, and false discovery rate for the "all 16" 
best solutions evolved on the early networks (Weeks 1-6) and validated on the late networks (Weeks 7-12), using fitness functions which select the top 20, 200, 2000 
or 20000 scores as new links. ^^ 
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Figure A 16: Average sensitivity, false positive rate, accuracy, specificity, positive predictive value, negative predictive value, and false discovery rate for the "topol2" 
best solutions evolved on the early networks (Weeks 1-6) and validated on the late networks (Weeks 7-12), using fitness functions which select the top 20, 200, 2000 
or 20000 scores as new links. ^^ 



